Remove redundant _filter_ingested from streaming pipeline by KRRT7 · Pull Request #271 · microsoft/typeagent-py

KRRT7 · 2026-05-05T07:22:55Z

Summary

Removes _filter_ingested method and all per-batch are_sources_ingested DB queries from add_messages_streaming
The streaming API is now a dumb pipe — callers handle dedup in their generators
ingest_email.py already pre-filters via is_source_ingested before yielding
ingest_vtt.py enforces a fresh DB (refuses to run if it exists)
ingest_podcast.py uses unique source_ids by construction
Removes dead batch_skipped counter from ingest_email.py
Net -213 lines

Why: on a 10k-email re-ingest with batch_size=100, the framework was issuing ~100 are_sources_ingested queries that always returned empty sets (because the caller already filtered). Pure waste.

Stack

This is the base of a 3-PR stack. Merge order: #271 → #267 → #268

#	PR	Branch	Description
1	#271 (this)	`fix/remove-filter-ingested`	Remove framework dedup
2	#267	`feat/vtt-streaming-ingestion`	VTT streaming migration
3	#268	`refactor/email-dedup-consolidation`	Email bulk dedup

Test plan

make check passes (pyright 0 errors)
pytest tests/ passes (696 tests — 5 removed that tested framework-level dedup)
Removed tests verified as testing only the deleted behavior (pre-marking source_ids then expecting framework to skip them)

Closes #269

The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269

The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.

…or (#268) ## Summary **Upfront bulk dedup** (commit 1): - Replaces N per-file `is_source_ingested` DB queries with a single upfront `are_sources_ingested` bulk query + in-memory `set` lookup - `_email_generator` now takes a pre-collected `email_entries` list and `already_ingested: set[str]`, skipping known files before parsing (avoids wasted parse cost) - Single `_iter_emails` call feeds both the bulk pre-filter and the generator - Removed dead `messages_skipped` tracking from batch callback (always 0 since #271) **Cleanup**: - Removes `_flush_skipped`, skip tracking variables, unused `IStorageProvider` import - Net -49 lines ## Stack **This PR is stacked on #267. Merge order: #271 → #267 → #268** | # | PR | Branch | Description | |---|---|---|---| | 1 | #271 | `fix/remove-filter-ingested` | Remove framework dedup | | 2 | #267 | `feat/vtt-streaming-ingestion` | VTT streaming migration | | 3 | **#268** (this) | `refactor/email-dedup-consolidation` | Email bulk dedup | ## Reproducible test First ingest (populates DB): ``` mkdir -p /tmp/test_eml # create 2 .eml files (any valid RFC 5322 .eml will do) uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v ``` Re-ingest (should skip all): ``` uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v ``` Expected output on re-ingest: ``` Pre-filter: 2 of 2 already ingested Parsing and importing emails... Successfully ingested 0 email(s) Ingested 0 chunk(s) Skipped 2 already-ingested email(s) Extracted 0 semantic references Total time: 0.0s ``` Key behavior: the bulk query fires once before the generator runs. No emails are parsed on re-ingest (0.0s total time). ## Test plan - [x] `make format check test` passes (696 tests, 0 pyright errors) - [x] Manual test: re-ingest same .eml files, verify all skipped via upfront bulk dedup - [x] Manual test: ingest with `--verbose`, verify skip reporting in summary --------- Co-authored-by: Guido van Rossum <gvanrossum@gmail.com>

KRRT7 force-pushed the fix/remove-filter-ingested branch from 1c71230 to 3875789 Compare May 5, 2026 07:24

KRRT7 added a commit to KRRT7/typeagent-py that referenced this pull request May 5, 2026

Remove dead messages_skipped tracking

2ec72fd

The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.

This was referenced May 5, 2026

Migrate ingest_vtt.py to streaming ingestion API #267

Merged

Remove redundant per-file is_source_ingested check from email generator #268

Merged

gvanrossum approved these changes May 5, 2026

View reviewed changes

gvanrossum merged commit a058b2f into microsoft:main May 5, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant _filter_ingested from streaming pipeline#271

Remove redundant _filter_ingested from streaming pipeline#271
gvanrossum merged 1 commit intomicrosoft:mainfrom
KRRT7:fix/remove-filter-ingested

KRRT7 commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KRRT7 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KRRT7 commented May 5, 2026 •

edited

Loading